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Figure I: With the local representations extracted from Convolutional Neural Networks (CNNs), the ‘sand’ pixels (in the first image) are likely to be 
misclassified as ‘road’, and the ‘building’ pixels (in the second image) are easy to get confused with ‘streetlight’. Our DAG-RNN is able to significantly boost 
the discriminative power of local representations by modeling their contextual dependencies. As a result, it can produce smoother and more semantically 
meaningful labeling map. The figure is best viewed in color. 


Abstract 

In image labeling, local representations for image 
units are usually generated from their surrounding image 
patches, thus long-range contextual information is not ef¬ 
fectively encoded. In this paper, we introduce recurrent 
neural networks (RNNs) to address this issue. Specifically, 
directed acyclic graph RNNs (DAG-RNNs) are proposed to 
process DAG-structured images, which enables the network 
to model long-range semantic dependencies among image 
units. Our DAG-RNNs are capable of tremendously enhanc¬ 
ing the discriminative power of local representations, which 
significantly benefits the local classification. Meanwhile, 
we propose a novel class weighting function that attends to 
rare classes, which phenomenally boosts the recognition ac¬ 
curacy for non-frequent classes. Integrating with convolu¬ 
tion and deconvolution layers, our DAG-RNNs achieve new 
state-of-the-art results on the challenging SiftFlow, CamVid 
and Barcelona benchmarks. 


1. Introduction 

Scene labeling refers to associating one of the semantic 
classes to each pixel in a scene image. It is usually defined 
as a multi-class classification problem based on their sur¬ 
rounding image patches. However, some classes may be in¬ 
distinguishable in a close-up view. As an example in Figure 
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the ‘sand’ and ‘road’ pixels are hard to be distinguished 
even for humans with limited context. In contrast, their dif¬ 
ferentiation becomes conspicuous when they are considered 
in the global scene. Thus, how to equip local features with 
a broader view of contextual awareness is a pivotal issue in 
image labeling. 

In this paper, recurrent neural networks (RNNs) 191 ifT^ 
are introduced to address this issue by modeling the contex¬ 
tual dependencies of local features. Specifically, we adopt 
undirected cyclic graphs (UCG) to model the interactions 
among image units. Due to the loopy property of UCGs, 
RNNs are not directly applicable to UCG-structured im¬ 
ages. Thus, we decompose the UCG to several directed 
acyclic graphs (DAGs, and four DAGs are used in our ex¬ 
periments). In other words, an UCG-structured image is ap¬ 
proximated by the combination of several DAG-structured 
images. Then, we develop the DAG-RNNs, a generalization 
of RNNs ilEl, to process DAG-structured images. Each 
hidden layer is generated independently through applying 
DAG-RNNs to the corresponding DAG-structured image, 
and they are integrated to produce the context-aware fea¬ 
ture maps. In this case, the local representations are able to 
embed the abstract gist of the image, so their discriminative 
power are enhanced remarkably. 

We integrate the DAG-RNNs with the convolution and 
deconvolution layers, thus giving rise to an end-to-end train- 
able full labeling network. Functionally, the convolution 
layer transforms RGB raw pixels to compact and discrimi¬ 
native representations. Based on them, the proposed DAG- 
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RNNs model the contextual dependencies of local features, 
and output the improved context-aware representation. The 
deconvolution layer upsamples the feature maps to match 
the dimensionality of the desired outputs. Overall, the full 
labeling network accepts variable-size images and gener¬ 
ates the corresponding dense label prediction maps in a sin¬ 
gle feed-forward network pass. Furthermore, considering 
that the class frequency distribution is highly imbalanced in 
natural scene images, we propose a novel class weighting 
function that attends to rare classes. 

We test the proposed labeling network on three popu¬ 
lar and challenging scene labeling benchmarks (SiftFlow 
CD , CamVid O and Barcelona ||26l). On these datasets, we 
show that our DAG-RNNs are capable of greatly enhanc¬ 
ing the discriminative power of local representations, which 
leads to dramatic performance improvements over baselines 
(CNNs, even the VGG-verydeep-16 network 1^ ). Mean¬ 
while, the proposed class weighting function is able to boost 
the recognition accuracy for rare classes. Most importantly, 
our full labeling network significantly outperforms current 
state-of-the-art methods. 

Next, related work are firstly reviewed, compared and 
discussed in Section [J] Section [3] elaborates the details of 
the DAG-RNNs and how they are applied to image labeling. 
Besides, it presents the details of the full labeling network 
and the class weighting function. The detailed experimental 
results and analysis are presented in Section In the end, 
section [^concludes the paper. 

2. Related Work 

Scene labeling (also termed as scene parsing, semantic 
segmentation) is one of the most challenging problems in 
computer vision. It has attracted more and more attention 
in recent years. Here we would like to highlight and discuss 
three lines of works that are most relevant to ours. 

The first line of work is to explore the contextual mod¬ 
eling. One attempt is to encode context into local represen¬ 
tation. For example, Farabet et al. m stacks surrounding 
contextual windows from different scales; Pinheiro et al. 
CD increases the size of input windows. Sharma et al. 113 
adopts recursive neural networks to propagate global con¬ 
text to local regions. However, they do not consider any 
structure for image units, thus their correlations are not ef¬ 
fectively captured. In contrast, we interpret the image as an 
UCG, within which the connections allow the DAG-RNNs 
to explicitly model the dependencies among image units. 
Another attempt is to pass context to local classifiers by 
building probabilistic graphical models (PGM). For exam¬ 
ple, Shotton et al. 1^ formulates the unary and pairwise 
features in a 2nd-order Conditional Random Field (CRF). 
Zhang et al. l^ and Roy et al. (m build a fully connected 
graph to enforce higher order labeling coherence. Shuai 
et al.ED models the global-order dependencies in a non- 


parametric framework to disambiguate the local confusions. 
Our work also differs from them. First, the label dependen¬ 
cies are defined in terms of compatibility functions in PGM, 
while such dependencies are modeled through a recurrent 
weight matrix in RNNs. Moreover, the inference of PGM is 
inefficient as the convergence of local beliefs usually takes 
many iterations. In contrast, RNNs only need a single for¬ 
ward pass to propagate the local information. 

Some of the previous work exploit ‘recurrent’ ideas in 
a different way. They generally refer to applying the iden¬ 
tical model recurrently at different iterations (layers). For 
example, Pinheiro et al. Cll attachs the RGB raw data with 
the output of the Convolutional Neural Network (CNN) to 
produce the input for the same CNN in the next layer. Tu 
et al.|[28l augments the patch feature with the output of the 
classifier to be the input for the next iteration, and the classi¬ 
fier parameters are shared across different iterations. Zheng 
et al. |[35]| transforms Conditional Random Fields (CRF) to 
a neural network, so the inference of CRF equals to apply¬ 
ing the same neural network recurrently until some fixed 
point (convergence) is reached. Our work differs from them 
significantly. They model the context in the form of inter¬ 
mediate outputs (usually local beliefs), which implicitly en¬ 
codes the neighborhood information. In contrast, the con¬ 
textual dependencies are modeled explicitly in DAG-RNNs 
by propagating information via the recurrent connections. 

Recurrent neural networks (RNNs) have achieved great 
success in temporal dependency modeling for chain- 
structured data, such as natural language and speeches. Zuo 
et al. (JTl applies ID-RNN to model weak contextual de¬ 
pendencies in image classification. Graves et al. ii gener¬ 
alizes ID-RNN to multi-dimensional RNN (MDRNN) and 
applies it to offline arabic handwriting recognition. Shuai 
et al.||22l also adopts 2D-RNN to real-world image label¬ 
ing. Recently, Tai et al. ll25ll and Zhu et al. |[36l demon¬ 
strate that considering tree structure (constituent / parsing 
trees for sentences) is beneficial for modeling the global 
representation of sentences. Our proposed DAG-RNN is a 
generalization of chain-RNNs ifTl llTOl . tree-RNNs 1^1361 
and 2D-RNNS llslll^ . and it enables the network to model 
long-range semantic dependencies for graphical structured 
images. The most relevant work to ours is ll22ll . In compar¬ 
ison with which, (1), we generalize 2D-RNN to DAG-RNN 
and show benefits in quantitative labeling performance; (2), 
we integrate the convolution layer, deconvolution layer with 
our DAG-RNNs to a full labeling network; and (3), we 
adopt a novel class weighting function to address the ex¬ 
tremely imbalanced class distribution issue in natural scene 
images. To the best of our knowledge, our work is the first 
attempt to integrate the convolution layers with RNNs in an 
end-to-end trainable network for real-world image labeling. 
Moreover, the proposed full network achieves state-of-the- 
art on a variety of scene labeling benchmarks. 


3. Approach 

To densely label an image /, the image is processed by 
three different functional layers sequentially: (1), Convolu¬ 
tion layer produces the corresponding feature map x. Each 
feature vector in x summarizes the information from a local 
region in I. (2), DAG-RNNs model the contextual depen¬ 
dency among elements in x, and generates the intermediate 
feature map h, whose element is a feature vector that im¬ 
plicitly embeds the abstract gist of the image. (3), Deconvo¬ 
lution layer llT4l upsamples the feature maps. From which, 
the dense label prediction maps are derived. We start by in¬ 
troducing the proposed DAG-RNNs, and the details of the 
full network are elaborated in the following sections. 

3.1. RNNs Revisited 

A recurrent neural network (RNN) is a class of artificial 
neural network that has recurrent connections, which equip 
the network with memory. In this paper, we focus on the 
Elman-type network gi. Specifically, the hidden layer 
in RNNs at time step t is expressed as a non-linear function 
over current input and hidden layer at previous time 
step . The output layer is connected to the hidden 
layer 

Mathematically, given a sequence of inputs 
an Elman-type RNN operates by computing the following 
hidden and output sequences: 

+ h) 

yi*) = g{Vh^^^ + C) 

where f/, W are weight matrices between the input and 
hidden layers, and among the hidden units themselves, 
while V is the output matrix connecting the hidden and out¬ 
put layers; 6, c are corresponding bias vectors and 
are element-wise nonlinear activation functions. The initial 
hidden unit is usually assumed to be 0. The local infor¬ 
mation is progressively stored in the hidden layers by 
applying Equation In other words, the contextual infor¬ 
mation (the summarization of past sequence information) is 
explicitly encoded into local representation which im¬ 
proves their representative power dramatically in practice. 

Training a RNN can be achieved by optimizing a dis¬ 
criminative objective with a gradient-based method. Back 
Propagation through time (BPTT) |[3Qt is usually used to 
calculate the gradients. This method is equivalent to un¬ 
folding the network in time and using back propagation in 
a very deep feed-forward network except that the weights 
across different time steps (layers) are shared. 

3.2. DAG-RNNs 

The aforementioned RNN is designed for chain- 
structured data (e.g. sentences or speeches), where tempo¬ 
ral dependency is modeled. However, interactions among 
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Figure 2: An 8-neighborhood UCG and one of its induced DAG in the 
southeastern (SE) direction. 


image units are beyond chain. In other words, traditional 
chain-structured RNNs are not suitable for images. Specif¬ 
ically, we can reshape the feature tensor x G to 

X G and generate the chain representation by 

connecting contiguous elements in x. Such a structure loses 
spatial relationship of image units, as two adjacent units in 
image plane may not necessarily be neighbors in the chain. 
The graphical representations that respect the 2-D neighbor¬ 
hood system are more plausible solutions, and they are per¬ 
vasively adopted in probabilistic graphical models (PGM). 
Therefore in this work, undirected cyclic graphs (UCG , an 
example is shown in Figure]^ are used to model the inter¬ 
actions among image units. 

Due to the loopy structure of UCGs, they are unable 
to be unrolled to an acyclic processing sequence. There¬ 
fore, RNNs are not directly applicable to UCG-structured 
images. To address this issue, we approximate the topol¬ 
ogy of UCG by a combination of several directed acyclic 
graphs (DAGs), each of which is applicable for our pro¬ 
posed DAG-RNNs (one of the induced DAGs is depicted 
in Figure [^. Namely, an UCG-structured image is repre¬ 
sented as the combination of a set of DAG-structured im¬ 
ages. We now start introducing the detailed mechanism of 
our DAG-RNNs here, and later elaborate how they are ap¬ 
plied to UCG-structured images in the next section. 

We first assume that an image / is represented as a DAG 
Q = where V = is the vertex set and 

E = {cij} is the arc set (e^j denotes an arc from Vi to Vj). 
The structure of the hidden layer h follows the same topol¬ 
ogy as Q. Therefore, a forward propagation sequence can be 
generated by traversing Q, on the condition that one node 
should not be processed until all its predecessors are pro¬ 
cessed. The hidden layer is represented as a nonlinear 
function over its local input x^^"^ and the summarization 
of hidden representation of its predecessors. The local in¬ 
put x^^*^ is obtained by aggregating (e.g. average pooling) 
from constituent elements in the feature tensor x. In detail, 
the forward operation of DAG-RNNs is calculated by the 










following equations: 

VjEVgivi) 

h(«i) = /({/x^*^*) + Wh<'^’'> + b) 

= g{Vh^^'^ + c) 

where are the representations of input, 

hidden and output layers located at Vi respectively, Vg{vi) 
is the direct predecessor set of vertex Vi in the graph Q, 
summarizes the information of all the predecessors of Vi. 
Note that the recurrent weight W in Equation is shared 
across all predecessor vertexes in Vg{vi). We may learn a 
specific recurrent matrix W for each predecessor when ver¬ 
texes (except source and sink vertex) in the DAG Q have a 
fixed number of predecessors. In this case, a finer-grained 
dependency may be captured. 

The derivatives are computed in the backward pass, and 
each vertex is processed in the reverse order of forward 
propagation sequence. Specifically, to derive the gradients 
at Vi, we look at equations (besides Equationthat involve 
h(^^) in the forward pass: 

\/vk G Sg{vi) 

h(^fc) = -f- -h -h b) 

VjeVg(vf,)-{vi} 

where Sg{vi) is the direct successor set for vertex Vi in 
the graph Q. It can be inferred from Equation S that 
the errors backpropagated to the hidden layer (dh^*^) at Vi 
have two sources: direct errors from Vi (|§^), and sum¬ 
mation over indirect errors propagated from its successors 
derivatives at Vi can then be com¬ 
puted by the following equations: Q 

vk^^givi) 

Vk^'Sgivi) 

where o denotes the Hadamard product, g'{-) = 
is the derivative of loss function L with respect to the out¬ 
put function g, and /'(•) = |j. It is the second term of 

in Equation]^ that enables DAG-RNNs to propagate 
local information, which behaves similarly to the message 
passing 1321 in probabilistic graphic models. 

^To save space, we omit the expression for Ab and Ac here as they can 
be inferred trivially from Equation 



Figure 3: The architecture of the full labeling network, which consists of 
three functional layers: (1), convolution layer: it produces discriminative 
feature maps; (2), DAG-RNN: it models the contextual dependency among 
elements in the feature maps; (3), deconvolution layer: it upsamples the 
feature maps to output the desired sizes of label prediction maps. 


3.3. Decomposition 

We decompose the UCG U to a. set of DAGs = 
{01, • • •, 0d, • • •}. Hence, the UCG-structured image is rep¬ 
resented as the combination of a set of DAG-structured im¬ 
ages. Next, DAG-RNNs are applied independently to each 
DAG-structured image, and the corresponding hidden layer 
hd is generated. The aggregation of the independent hidden 
layers yields the output layer o. These operations can be 
mathematically expressed as follows: 

+ y] + ba) 

= 9{ Y 

where Ud^ Wd^ Vd and bd are weight matrices and bias vec¬ 
tor for the DAG Qd^ direct predecessor set 

of vertex Vi in Qd- This strategy is reminiscent of the tree- 
reweighted max-product algorithm (TRW) 1291 . which rep¬ 
resents the problem on the loopy graphs as a convex combi¬ 
nation of tree-structured problems. 

We consider the following criterions for the decompo¬ 
sition. Topologically, the combination of DAGs should be 
equivalent to the UCG U, so any two vertexes can be reach¬ 
able. Besides, the combination of DAGs should allow the 
local information to be routed to anywhere in the image. In 
our experiment, we use the four context propagation direc¬ 
tions (southeast, southwest, northwest and northeast) sug¬ 
gested by naea to decompose the UCG. One example of 
the induced DAG of the 8-neighborhood UCG in the south¬ 
east direction is shown in Eigurej^ 

3.4. Full Labeling Network 

The skeleton architecture of the full labeling network is 
illustrated in Eigurej^ The network is end-to-end trainable, 
and it takes input as raw RGB images with any size. It out¬ 
puts the label prediction maps with the same size of inputs. 

The convolution layer is used to produce compact yet 
highly discriminative features for local regions. Next, the 
proposed DAG-RNN is used to model the semantic con¬ 
textual dependencies of local representations. Einally, the 
deconvolution layer ifT^ is introduced to upsample the fea¬ 
ture maps by learning a set of deconvolution filters, and it 

























Figure 4: Graphical visualization of the class frequencies (left) and weights 
(right) on the siftFlow datasets ca The classes are sorted in the descend¬ 
ing order based on their occurrence frequencies in training images. 


enables the full labeling network to produce the desired size 
of label prediction maps. 

To train the network, we adopt the average weighted 
cross entropy loss. It is formally written as: 

viei j=i 


a frequent class, k is 3. constant that controls the importance 
of rare classes (k = 2 in our experiments). The proposed 
weighting function has the following properties: ( 1 ), it at¬ 
tends to rare classes by assigning them higher weights; ( 2 ), 
the degree of attention for rare classes grows exponentially 
based on their ratio magnitudes w.r.t the threshold r]\ The 
following criterion is used to determine the value of rj: the 
accumulated frequency of all the non-rare classes is 85%. 
We call it 85%-15% rule, and ED uses a similar rule. 

4. Experiments 

We justify our method on three popular and challeng¬ 
ing real-world scene image labeling benchmarks: SiftFlow 
Ca, CamVid El and Barcelona 1261 . Two types of scores 
are reported: the percentage of all correctly classified pixels 
(Global), and average per-class accuracy (Class). 

4.1. Baselines 


where N is the number of image units in image /; w is the 
class weight vector, in which Wj stands for the weight for 
class j; is the binary label indicator vector for the im¬ 
age unit located in Vi, and stands for the corresponding 
class likelihood vector. The errors propagated from DAG- 
RNNs to the convolution layer for image unit Vi are calcu¬ 
lated based on the following equations: 

= Y, UjdhYof'ihY) (7) 

3.5. Attention to Rare Classes 

In scene images, the class distribution is extremely im¬ 
balanced. Namely, very few classes account for large per¬ 
centage of pixels in images. An example is demonstrated 
in Figure]^ It’s therefore common to put more attention to 
rare classes, in order to boost their recognition precisions. 

In the patch-based CNN training, Farabet et al. Q and 
Shuai et al. ED oversample the rare-class pixels to address 
this issue. It’s however inapplicable to adopt this strategy in 
our network training, which is a complex structure learning 
problem. Meanwhile, as the classes are distributed severely 
unequally in scene images, it’s also problematic to weigh 
classes according to their inverse frequencies. As an ex¬ 
ample, the frequency ratio between the most frequent (sky) 
and the most rare class (moon) on the SiftFlow dataset is 
3.5 X 10^. If the above class weighting criterion is adopted 
like in ca, the frequent classes will be under-attended. 
Hence, we define the weighting function w as follows: 

Wj = (g) 

where [•] is the integer ceiling operator, fj is the occur¬ 
rence frequency of the class j, 77 denotes the threshold that 
discriminates the rare classes. Specifically, a class is identi¬ 
fied as rare if its frequency is smaller than r], otherwise, it is 


The convolution neural network (CNN), which jointly 
learn features and classifiers is used as our first baseline. In 
this case, the parameters are optimized to maximize the in¬ 
dependent prediction accuracy for local patches. Another 
baseline is the network that shares the same architecture 
with our DAG-RNNs, while removes the recurrent connec¬ 
tions. Mathematically, the Wd and bd in Equation are 
fixed to 0 . In this case, the DAG-recurrent neural network 
degenerates to an ensemble of four plain two-layer neural 
networks (CNN-ENN). The performance disparity between 
the baselines and DAG-RNNs clearly illuminates the effi¬ 
cacy of our dependency modeling method. 

4.2. Implementation Details 

We use the following two networks to be the convolution 
layers in our experiments: 

• CNN-65: The network consists of five convolutional 
layers, the kernel sizes of which are 8 x 8 x 3 x 64, 
6 X 6 X 64 X 128, 5 x 5 x 128 x 256, 4 x 4 x 256 x 256 
and 1x1x256x64 respectively. Each of the first three 
convolutional layers are followed by a ReLU and non¬ 
overlapping 2x2 max pooling layer. The parameters 
of this network is learned from image patches (65 x 65) 
of the target dataset only (Setting 1). 

• VGG-conv5: The network borrows its architecture 
and parameters from VGG-verydeep-16 net 1^ . In 
detail, we discard all the layers after the 5* pooling 
layer to yield the desired convolution layer. The net¬ 
work is pre-trained on ImageNet dataset and fine-tuned 
on the target dataset. Q (Setting 2). 

In DAG-RNNs, the adopted non-linear functions (refer 
to Equation are ReLU CD for hidden neurons: f{x) = 
max(0, x) and softmax for output layer g. In practice, we 
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Figure 5: Two UCGs (with 4, 8 neighborhood system) and their induced 
DAGs in the northwestern (NW) direction. 


apply the function g after the deconvolution layer. The di¬ 
mensionality of hidden layer h is empirically set to 64 for 
CNN-65 and 128 for VGG-conv5 respectively. In our 
experiments, we consider two UCGs with 4 and 8 neigh¬ 
borhood systems. Their induced DAGs in the northwest¬ 
ern direction are shown in Figure In comparison with 
DAG(4), DAG(8) enables information to be propagated in 
shorter paths, which is critical to prevent the long-range in¬ 
formation from vanishing. As exampled in Figure the 
length of propagation path from vg to vi in is halved to 
that in (4^2 steps). 

The full network is trained by stochastic gradient descent 
with momentum. The parameters are updated after one im¬ 
age finishes its forward and backward passes. The learning 
rate is initialized to be 10“^, and decays exponentially with 
the rate of 0.9 after 10 epoch. The reported results are based 
on the model trained in 35 epoches. We tune the parameters 
and diagnoses the network performance based on CNN-65. 
We also include the results of VGG-conv5 to see whether 
our proposed DAG-RNNs are beneficial for the highly dis¬ 
criminative representation from the state-of-the-art VGG- 
verydeep-16 net 1 ^ . 

4.3. SiftFlow Dataset 

The SiftFlow dataset has 2688 images generally captured 
from 8 typical outdoor scenes. Every image has 256 x 256 
pixels, which belong to one of the 33 semantic classes. We 
adopt the training/testing split protocol (2488/200 images) 
provided by ca to perform our experiments. Following the 
85%-15% criterion, the class frequency threshold r/ = 0.05. 
Statistically, out of 33 classes, 27 of them are regarded as 
infrequent class. The graphical visualization of the weights 
for different classes are depicted in Figure]^ 

The quantitative results are listed in Table within 
which the upper part presents the performance of meth¬ 
ods under setting 1. Our baseline CNN-65 achieves very 
promising results, which proves the effectiveness of the 
convolution layer. We also notice that results of CNN- 
65 fall behind CNN-65-ENN on the average class accu¬ 
racy. This phenomenon is also observed on the CamVid and 


Methods 

Global 

Class 

Byeon et al (4] 

70.1% 

22.6% 

Liu et al. 1131 

74.8% 

N/A 

Farabet et al (T) 

78.5% 

29.4% 

Pinheiro et al flTI 

77.7% 

29.8% 

Tighe et al (27) 

79.2% 

39.2% 

Sharma et al flTl 

79.6% 

33.6% 

Shuai et al 1211 

80 . 1 % 

39.7% 

Yang et al (M) 

79.8% 

48 . 7 % 

CNN-65 

76.1% 

32.5% 

CNN-65-ENN 

76.1% 

37.0% 

CNN-65-DAG-RNN(4) 

80.5% 

42.6% 

CNN-65-DAG-RNN(8) 

81 . 1 % 

48 . 2 % 

Long et al UJJ 

85.2% 

51.7% 

VGG-conv5-ENN 

84.0% 

48.8% 

VGG-conv5-DAG-RNN(8) 

85 . 3 % 

55 . 7 % 


Table 1: Quantitative performance of our method on the siftFlow dataset. 
The numbers (in brackets) following the DAG-RNN denote the neighbor¬ 
hood system of the UCG. 


Barcelona benchmarks, as shown by Table and [^respec¬ 
tively. This result indicates that the proposed class weight¬ 
ing function significantly boosts the recognition accuracy 
for rare classes. By adding DAG-RNN(8), our full net¬ 
work reaches 81.1% (48.2%) on the global (class) accuracy 
1^, which outperforms the baseline (CNN-65-ENN) by 5% 
(11.2%). Meanwhile, we observe promising accuracy gain 
(global: 0.6% / class: 5.6% ) by switching DAG-RNN (4) to 
DAG-RNN (8), in which we believe that long-range depen¬ 
dencies are better captured as information propagation paths 
in DAG(8) are shorter than those in DAG(4). Such perfor¬ 
mance benefits can be observed consistently on the CamVid 
(0.5% / 2.0%) and Barcelona (1.1% / 1.6%) datasets, as evi¬ 
denced in Table and [^respectively. Moreover, in compar¬ 
ison with other representation learning nets, which are fed 
with much richer contextual input (133x133 patch in ifTTll . 
3-scale 46x46 patches in (71), our DAG-RNNs outperform 
theirs by a large margin. Importantly, our results match the 
state-of-the-art under this setting. 

Eurthermore, we initialize our convolution layers with 
VGG-verydeep-16 (231 . which has been proven to be the 
state-of-the-art feature extractor. The quantitative results 
under setting 2 are listed in the lower body of Table [^ 
Our baseline VGG-conv5-ENN surpasses the best perfor¬ 
mance of methods under setting 1. This result indicates 
the significance of large-scale data in deep neural network 
training. Interestingly, our DAG-RNN(8) is still able to 
further improve the discriminative power of local features 
by modeling their dependencies, thereby leading to a phe¬ 
nomenal (6.9%) average class accuracy boost. Note that 
Eully Convolution Networks (ECNs) (TH uses activations 


^Based on our preliminary results, we didn’t observe too much perfor¬ 
mance improvement by using larger h (e.g. 128 in CNN-65, and 256 in 
VGG-conv5) on the siftFlow dataset. In addition, the networks with larger 
capacity incur much heavier computation burdens. 


^If we disassemble the full labeling network to two disjoint parts - 
CNN-65 and DAG-RNN(8), and they are optimized independently, the 
corresponding accuracies are 80.1% and 42.7%. The performance discrep¬ 
ancy indicates the importance of the joint optimization for the full network. 


























Methods 

Global 

Class 

Tighe et al (2^ 

78.6% 

43.8% 

Sturgess et al. (24) 

83.8% 

59.2% 

Zhang et al (SI 

82.1% 

55.4% 

Bulo et al. 0 

82.1% 

56.1% 

Ladicky et al fl^ 

83.8% 

62 . 5 % 

Tighe et al 1271 

83 . 9 % 

62 . 5 % 

CNN-65 

84.3% 

53.2% 

CNN-65-ENN 

84.1% 

58.1% 

CNN-65-DAG-RNN(4) 

88.2% 

66.3% 

CNN-65-DAG-RNN(8) 

88 . 7 % 

68 . 3 % 

VGG-conv5-ENN 

91.0% 

76.5% 

VGG-conv5-DAG-RNN (8) 

91 . 6 % 

78 . 1 % 


Table 2: Quantitative performance of our method on the CamVid dataset. 

(feature maps) from multiple convolution layers, whereas 
our VGG-conv5-ENN only use feature maps from conv5 
layer. Hence, there is a slight performance gap between 
our VGG-conv5-ENN and ECNs. Nonetheless, our VGG- 
conv5-DAG-RNN(8) still performs comparably with ECNs 
on global accuracy, and significantly outperforms it on the 
class accuracy. Importantly, our full labeling network also 
achieves new state-of-the-art performance under this set¬ 
ting. The detailed per-class accuracy is listed in Table 

4.4. CamVid Dataset 

The CamVid dataset m contains 701 high-resolution im¬ 
ages (960 X 720 pixels) from 4 driving videos at daytime 
and dusk (3 daytime and 1 dusk video sequence). Images 
are densely labelled with 32 semantic classes. We follow 
the usual split protocol I^IZTl (468/233) to obtain train¬ 
ing/testing images. Similar to other works GlEllEaGTl, 
we only report results on the most common 11 categories. 
According to the 85%-15% rule, 4 classes are identified as 
rare, and is 0.1. 

The quantitative results are given in Table Our base¬ 
line networks (CNN-65, CNN-65-ENN) achieve very com¬ 
petitive results. By explicitly modeling contextual depen¬ 
dencies among image units, our CNN-65-DAG-RNN(8) 
brings phenomenal performance benefit (4.6% and 10.2% 
for the global and class accuracy respectively). Moreover, 
in comparison with state-of-the-art methods EKHEa 
IIZTII . our CNN-65-DAG-RNN(8) outperforms theirs by a 
large margin (4.8% / 5.8%), demonstrating the profitabil¬ 
ity of adopting high-level features learned from CNN and 
context modeling with our DAG-RNNs. Eurthermore, the 
VGG-conv5-ENN alone performs excellently. Even though 
the performance starts saturating, our DAG-RNN(8) is able 
to consistently improve the labeling results. 

4.5. Barcelona Dataset 

The barcelona dataset ll26l consists of 14871 training and 
279 testing images. The size of the images varies across dif¬ 
ferent instances, and each pixel is labelled as one of the 170 


Methods 

Global 

Class 

Tighe et al 1261 

66.9% 

7.6% 

Earabet et al (7) 

46.4% 

12 . 5 % 

Earabet et al (7] 

67 . 8 % 

9.5% 

CNN-65 

69.0% 

10.5% 

CNN-65-ENN 

69.0% 

11.0% 

CNN-65-DAG-RNN(4) 

71.3% 

12.9% 

CNN-65-DAG-RNN(8) 

72 . 4 % 

14 . 5 % 

VGG-conv5-ENN 

73.3% 

21.1% 

VGG-conv5-DAG-RNN(8) 

74 . 6 % 

24 . 6 % 


Table 3: Quantitative performance of our method on the Barcelona dataset. 

semantic classes. The training images range from indoor 
to outdoor scenes, whereas the testing images are only cap¬ 
tured from the barcelona street scene. These issues pose 
Barcelona as a very challenging dataset. Based on the 85%- 
15% rule, 147 classes are identified as rare classes, and the 
class frequency threshold r] is 0.005. 

Table presents the quantitative results. Erom which, 
we clearly observe that our baseline networks (CNN-65 and 
CNN-65-ENN) achieve very competitive results, which has 
already matched the state-of-the-art results. The introduc¬ 
tion of DAG-RNN(8) leads to promising performance im¬ 
provement, therefore the full labeling network clinches the 
new state-of-the-art under setting 1. More importantly, un¬ 
der setting 2, even though the VGG-conv5-ENN is extraor¬ 
dinarily competitive, the DAG-RNN(8) is still able to en¬ 
hance its labeling performance significantly. 

4.6. Effects of DAG-RNNs to Per-class Accuracy 

In this section, we investigate the effects of our DAG- 
RNNs for each class. The detailed per-class accuracy for the 
SiftElow dataset is listed in Table|^ Under setting 1, we find 
that the contextual information encoded through our DAG- 
RNN(8) is beneficial for almost all classes. In this case, the 
local representations from CNN-65 are not strong, so their 
discriminative power can be greatly enhanced by modeling 
their dependencies. In line with it, we observe remarkable 
performance benefit (-fll.2%) for almost all classes. Under 
setting 2, the VGG-conv5 net is pre-trained on the ImageNet 
dataset O, and it recognizes most classes excellently. Even 
though the local representations are highly discriminative in 
this situation, our DAG-RNN(8) further tremendously im¬ 
proves their representative power for rare classes. Statisti¬ 
cally, we observe a phenomenal 8.6% accuracy gain for rare 
classes. Under both settings, modeling the dependencies 
among local features enables the classification to be contex¬ 
tual aware. Therefore, the local ambiguities are mitigated 
to a large extent. However, we fail to observe commen¬ 
surate accuracy improvements for extremely-small-size and 
rare ’object’ classes (e.g. bird and bus), we conjecture that 
the weak local information may have been overwhelmed by 
context (e.g. a small bird is swallowed by the broad sky in 
Eigure[T]). 
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CNN-65-ENN 

94.1 

84.9 

77.6 

74.2 

80.9 

61.1 

30.7 

71.0 

27.7 

34.7 

21.8 

63.5 
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0.37 
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0 
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6.0 

71.4 

0 

76.1 
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CNN-65-DAG-RNN(8) 

95.9 

87.3 

82.5 

77.9 

85.8 

70.2 

43.5 

80.1 

52.9 

65.4 

37.2 
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96.0 

91.1 

84.4 

82.9 

90.4 

83.5 

48.8 

77.5 

63.5 

57.6 

32.6 

60.2 

34.9 

66.0 
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20.0 
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44.0 

38.6 

45.9 

26.5 

33.7 

14.9 

50.2 

1.1 

0 

32.7 

9.1 

99.9 

0 

84.0 

48.8 

VGG-conv5-DAG-RNN(8) 

96.3 

90.8 

82.1 

85.1 

89.2 

84.8 

55.4 

84.2 

67.9 

75.3 

51.5 

64.8 

45.2 

63.5 

45.7 

37.3 

56.8 

44.7 

36.2 

58.7 

18.3 

40.0 

63.3 

65.2 

18.4 

1.4 

45.8 

5.4 

97.9 

0 
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Table 4: Per-class accuracy comparison on the SiftFlow dataset. All the numbers are displayed in the percentage scale. The statistics for class frequency is 
obtained in test images. For reading convenience, the frequent and rare classes are placed in the same block. 
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Figure 6: Qualitative labeling results (best viewed in color). We show input images, local prediction maps (CNN-65-ENN), contextual labeling maps 
(CNN-65-DAG-RNN(8)) and their ground truth respectively. The numbers outside and inside the parentheses are global and class accuracy respectively. 


4.7. Discussion of Modeled Dependency 

We show a number of qualitative labeling results in Fig¬ 
ure By looking into them, we can have some interesting 
observations. The DAG-RNNs are capable of (1), enforcing 
local consistency: neighborhood pixels are likely to be as¬ 
signed to the same labels. In Figurethe left-panel exam¬ 
ples show that confusing regions are smoothed by using our 
DAG-RNNs. (2), ensuring semantic coherence: the pixels 
that are spatially far away are usually given labels that could 
co-occur in a meaningful scene. For example, the ‘desert’ 
and ‘mountain’ classes are usually not seen together with 
‘trees’ in a ‘open country’ scene, so they are corrected to 
‘stone’ in the second example of the right panel. More ex¬ 
amples of this kind are shown in the right panel. These re¬ 
sults illuminate that short-range and long-range contextual 
dependencies may have been captured by our DAG-RNNs. 


5. Conclusion 

In this paper, we propose DAG-RNNs to process DAGs- 
structured data, where the interactions among local features 
are considered in a graphical structure. Our DAG-RNNs 
are capable of encoding the abstract gist of images into lo¬ 
cal representations, which tremendously enhance their dis¬ 
criminative power. Furthermore, we propose a novel class 
weighting function to address the imbalanced class distri¬ 
bution issue, and it is experimentally proved to be effective 
towards the recognition enhancement for rare classes. Inte¬ 
grating with the convolution and deconvolution layers, our 
DAG-RNNs achieve state-of-the-art results on three chal¬ 
lenging scene labeling benchmarks. We also demonstrate 
that useful long-range contextual dependencies are captured 
by our DAG-RNNs, which is helpful for generating smooth 
and semantically sensible labeling maps in practice. 
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